Code
prevalence = exp(linear_predictor) / (1 + exp(linear_predictor))With the advance of widely accessible public health and urban design datasets, people have even greater access to data from the general public, making large datasets detailing certain community health metrics open to the world . However, this data is often not easily accessible, making it quite difficult to peruse the millions of dataset provided by the government to figure out what exactly has caused a significant portion of the community’s health outcomes. Through this project, this data will be searched, parsed, and analysed through regression analysis and an average health metric index algorithm to give us a greater insight into what health metrics are the most significant in terms of community health outcomes.
Global urbanisation has been linked to increasing health problems in urban communities around the world, recognising the need for accurate and careful urban planning policy-making corburn_why_2012. In order to tie decision-making to accurate prioritisation of urban areas, especially on a localised scale, urban health indicators (UHIs) have often been used. UHIs are defined as tools that contribute to decision-making through quantitative representation of health inequalities and conditions in urban environments. Formally, a UHI is “a collection of summary measures about the physical urban environment’s contribution to human health and wellbeing”, which usually serves the purpose of depicting a “complex social, economic or physical reality” mullin_measuring_nodate. Measures typically relate to “quality of life, liveability, and wellbeing”. UHIs often serve multiple purposes in decision-making processes, including informing urban citizens, marking potential problems, and allowing for reasoned prioritisation. For example, the San Francisco Department of Public Health guides decision-making concerning land use and development, where a combination of government agencies and informed citizens developed optimal future plans for the city. Additionally, UHIs were utilized in Bristol, UK that effectively informed citizens and provided transparency and priorities for future city work. Various individual indicators contribute to a UHI — these indicators are typically certain patterns in urban planning seen to have significant correlation with health outcomes. Thus, we also make a distinction between UHIs and individual indicators — UHIs are quantitative representations of a broader concept, and are made up of individual indicators.
Numerous urban measures used in indicators have established correlations with public health outcomes. The amount of green space in urban areas has been linked to improved mental health and encouraged more physical activity, while also decreasing community violence and death rate. Implementation of supermarkets (sources of fresh food) has accompanied lower food insecurity rates, lower dependence on federal food aid, and improved overall well-being health-wise. Widespread community accessibility to public transport has been linked to lower rates of obesity, decreased pollutant and emission concentrations, and lowered other prominent cardiovascular and mental health ailments (Litman, 2016). While these indicators can have strong correlations with health outcomes, causality cannot be attributed to them, as these studies are observational and cannot establish causality. However, these studies highlight measured relationships between urban planning and public health that can be communicated and utilised to foster prioritised decision-making. Additionally, these are commonly utilized indicators in UHI development due to their ability to be measured, widespread availability of data, and accurate depictions of public health. It is also important to avoid considering these indicators to have direct impact on health outcomes, as indicators influence a variety of factors that contribute to health outcomes.
A purpose of many existing UHIs in literature is to simplify decision-making in a complex field (urban planning and its ties to public health), causing decisions to be streamlined and efficient. This simplification occurs by representing relationships and correlations as numerical values. However, in order to do so successfully, UHIs must logically account for complexities in the field; many UHIs did not explicitly mention a methodology to do so. Additionally, indicator selection for current UHIs appears to be influenced by the difficulty of obtaining such indicators, making data that is easier to measure and analyze more likely to contribute to UHIs.
While many UHIs have been produced in the past, methodologies often vary. Some indices centre around the summation of separate, constituent scores that are calculated on their own basis, usually as a standardised statistic higgs_urban_2019. This index methodology is present in the Urban Liveability Index (ULI), where various statistical measures, including density measures, ratios, and ordinal measures, were calculated independently, standardised to allow for equal and valid comparison, and summated. Indices similar to the Child Opportunity Index focus solely on standardisation, in which z-score positions across a wide array of indicators constitute a singular, combined index score acevedo-garcia_child_2014. Indices that are constructed with subjective data, including the Community Well-Being Index (A) in Korea, also utilise standardisation and z-score methodologies by quantifying opinions through psychometric scales like the Likert scale kim_development_2014. Polynomial estimation, especially at higher degrees, can also predict behaviours of certain indicators. For example, the HealthyCity Air-Noise Index utilises a third-degree polynomial to estimate noise annoyance, a formulated yet accurate method to quantify a subjective indicator silva_environmental_2015.
Many indices take a more differentiated path where specific weights modify existing data, making certain indicators have a stronger contribution on the final index score than other indicators. A common pathway of weighting indicators is Data Envelopment Analysis (DEA), which generally examines efficiency in how well an input/independent variable stimulates an output/dependent variable, especially in the context of many inputs and outputs. The Subjective Community Well-Being Indicator uses this approach, where social and financial aspects (e.g. health, job opportunities) are analysed against satisfaction in various corresponding fields bernini_dea-like_2013. Additionally, weights can be arbitrarily generated in sets, where each set undergoes sensitivity analysis to determine if it optimally formulates the UHI rothenberg_flexible_2014. Generating various iterations of weight sets is recommended for arbitrary weighting; Rothenburg et al.’s UHI uses n = 1000 sets, each undergoing sensitivity analysis.
Sometimes, weights are simply chosen on a relativistic scale, which is most often based on perceived significance and data reliability. Yale’s Environmental Performance Index subjectively perceives different indicators and their relationships with environmental status and policy, as well as weighing in data reliability (Block, 2024). For instance, weights are decreased for an indicator that has only one year of data or rendered more obsolete compared to other indicators with respect to its impacts on a certain outcome. Weights can also be attributed to statistical measures, which analyze indicator purpose to a more pure, objective lens. The Human Development/Poverty Indices in India determines weights through factor loadings, a measure of correlation between indicators and the outcome sourcing from SPSS analysis (Antony, 2007). This increases the significance of indicators that are more closely related to health outcomes compared to indicators that aren’t.
A more unique yet solid indicator construction technique is the Mazziato-Pareto Index (MPI) technique, which uses an arithmetic mean modulated by a penalizing factor. An arithmetic mean allows for non-arbitrary index determination, while the penalty adjustment (derived from indicator variability) prevents high indicator values from overshadowing low indicator values. Another technique is the Alkire-Foster (AF) method, which multiplies incidence by intensity (UNDP, 2023). Indicators are portrayed as binary, with specific hard thresholds marking whether or not the indicator contributes or does not contribute to the incidence. For example, in the Multidimensional Poverty Index, the poverty rate threshold is set such that only areas above that threshold would count - there is no soft gradient. Incidence is also determined by a hard threshold. Incidence is a proportion of areas that have a certain amount of indicators above their respective thresholds. Intensity is the average number of indicators above thresholds among areas that count toward incidence. The AF method provides a holistic viewpoint of not only where areas are affected, but how broader areas of studies are faring.
Other studies acknowledge the potential limitations on weighting indicators (especially differentially), citing that unequal weighting often resulted in rigid analysis across geographical areas with different trends and the faulty association of low correlation evidence with low influence. (Pineo, 2017). The practice of equating weights to certain statistical values (e.g. R2) can seem objective but masks temporal trends and socioeconomic trends with statistical values, which could incorrectly skew interpretations of data and compromise index integrity. Additionally, weighting loses some effectiveness when multiple outcomes are in play (i.e. multiple health diseases are studied), as a unique set of weights must be developed for each individual outcome richardson_developing_2010.
Regardless, most UHI development methodologies involve a filtration process to determine a concise, representative set of urban indicators to explain health trends (Pineo, 2018). Indicator filtration typically includes considerations of data availability, data recency, and clear purposes of data measurement (to avoid ambiguity in future interpretation). After a set of indicators is selected, weights are assigned to indicators in order to differentiate levels of impact on health outcomes (although weights can also be set equal to avoid limitations). Data values for each indicator are used in correspondence with their weights to create individual indicator values for each geographic area. (Pineo, 2017)
We first selected an optimal geographic scale (based on a territorial division). Scale is important to prevent sweeping over-generalisations (from a broader scale) or cases of overfitting (from a narrower scale). This indicated that ZIP codes, census tracts, or census blocks would be the most optimal scale of analysis. ZIP codes were ultimately rejected for the scale of analysis because they centre around mailing statistics, which could poorly explain demographic trends. Census blocks were not chosen because of high chances of overfitting and the lack of data accessibility (for both urban indicators and public health statistics).
To initialize analysis, we browsed through a set of X distinct urban indicators. This set was chosen on both objective and subjective criteria; while they must be measurable and accessible across all geographical areas (and on the census tract basis), they must also have a logical connection to public health, potentially eliminating possibilities of relationships produced by random chance. After this first analysis, seven indicators remained.
We subjectively selected response datasets under two main criteria: 1) They were common diseases in urban areas and 2) Their frequencies did not have high amounts of genetic influence (although some genetic influence is unavoidable). This criteria was selected in order to prevent unnecessary analysis on diseases that were not impacted or related to aspects of urban planning.
Because a wide array of explanatory variables could contribute to index development, we did not consider a rigid set of criteria for explanatory variable selection. Thus, explanatory variables were selected if they were related to urban planning and could have a logical connection to public health. Additionally, variables must be selected appropriately to prevent overfitting to a large set of variables. A total of 11 explanatory variables were initially selected for use. We obtained data in CSV format and integrated them through Microsoft Excel, where all data points were merged on a common census tract ID. Data plotting, however, indicated large groups of missing data, which sourced from the lack of CDC PLACES data from Florida. Any tracts with the “FL” Census denotation were removed from analysis. Additionally, while most variables had reasonable amounts of missing data, lakids1share (insert description), lahunv1share (insert description), and lapop1share (insert description) had great amounts of missing data, with around 40 percent of data missing for each variables. Any form of substitution for these missing data points would be greatly inaccurate as substitutions would be based on a small proportion of the sample size - therefore, these variables were also included from analysis.
Additionally, further data plotting indicated large groups of data points that contained “0”, especially for the est_ptrp, ht_ami, and est_vmiles variables. These variables are not suitable to be zero because miles travelled daily, vehicular trips daily, and the proportion of income spent on transportation and housing are not plausible to be zero in a census tract demographic. Because the frequency of zeroed values was not in great proportions, we replaced them with NaN values to be later imputed in a process described in detail below.
Additionally, to objectively center the analysis around urban areas, only urban and suburban tracts were selected. Rural tracts were eliminated based on a tract index derived from population and population density, which was established by the United States Department of Transportation’s Urbanicity Index (BTS, 2024). Rural tracts were identified as tracts that were not Urban Areas (UAs, over 50,000 people) or Urban Centers (UCs, between 2,500 and 50,000 people), meaning that they contained less than 2,500 people. This round of data filtration was also performed in Excel and resulted, alongside with imputation of NaN values, in 47092 census tracts across the nation. Additionally, through a triangular correlation matrix, indicators that shared major multicollinearity problems with other indicators (R2 > 0.90) were eliminated.
The final set of nine urban indicators allowed for different indicator permutations to be present in all 50 states of the United States. Permutations were chosen based on numerical values of each indicator’s respective variation inflation factor (VIF), which quantified multicollinearity between all chosen indicators in each state. If the maximum VIF value in the set of indicators was above 5, that indicator was removed. Indicators were removed one at a time because the removal of indicators results in lowered VIF scores for all other indicators, as there is less multicollinearity present in the set.
The response dataset was formed by six unique health factors: high blood pressure (BPHigh), high cholesterol (HighChol), coronary heart disease (CHD), chronic obstructive pulmonary disease (COPD), depression, and diabetes.
Data was collected on the census-tract scale, all from the year 2019 (data collected decenially were from 2010 instead of 2020 to avoid COVID-19 complications in the data).
Explanatory datasets were sourced from the National Neighborhood Data Archive at the University of Michigan (NaNDA), the US Census Bureau, the United States Department of Agriculture (USDA) Food Access Research Atlas, the United States Bureau of Transportation Statistics Local Area Transportation Characteristics for Households (LATCH), the Housing and Transportation Affordability Index (H+T), and the University of Minnesota’s Access Across America. NaNDA provided data on tract proportions of public transportation stops (stops) and park areas (park_area). The US Census Bureau provided a common thread of identification data between datasets, including state and population. The USDA Food Access Research Atlas provided data about food insecurity, measuring the proportion of a tract who had low access to food in a 0.5-mile radius (halfshare), as well as the percentage of families in a tract with low/no access to private vehicles (hunv). The LATCH provided data on vehicular usage, primarily in the form of person-trips (ptrp), the number of trips by one person in any form of transport in a day, or vehicle-miles (vmiles), the number of miles driven by a vehicle in a day multiplied by the passenger quantity. H+T provided data on employee gravity (emp_gravity), the total number of jobs in a tract divided by the square of its distance from the tract centroid, and the annual income spent on transportation and housing (ht_ami). Data from the Access Across America dataset related to job accessibility with respect to transit (transit), which measures the number of accessible jobs a person can reach through public transit services within an 1800-second threshold (a subjective selection that represents a tolerable commute time for an average worker).
Response datasets (health outcomes) were sourced from PLACES (Local Data for Better Health, Place Data 2020 release) by the Centers of Disease Control and Prevention. This 2020 PLACES release specifically contained model-based places, which include counties, places, census tracts, and ZIP Code tabulation Areas. PLACES data initially sourced from survey data, which are later analyzed in multilevel logistic regression models (CDC, 2022). That regression, which takes into account age, sex, race, education, and poverty rates, is responsible for estimating all six of the health metrics used in this paper. It is important to note that PLACES failed to tabulate data from Florida due to insufficient data tracking, causing this analysis to exclude Florida (CDC, 2022). In addition, it is also important to note that covariates are already accounted for within the production of the PLACES dataset. Therefore, it is unnecessary to control for demographic characteristics like age or sex when utilizing the PLACES dataset (Kong, 2020).
To develop a regression model, Python was utilised to create a program that takes a certain number of metrics and compares them to determine the statistical significance of proportionality and means between variables. Then, semi-partial correlation was utilised to find individual Pearson r-values that would eliminate covariate influence on the analysis. This variable analysis led to the development of the Urbhealth index.
The measure of person-trips was abbreviated to be ptrp throughout this analysis. A person-trip is defined by the United States Bureau of Transportation Statistics Local Area Transportation Characteristics for Households Survey (LATCH) as a trip made by a person in any mode of transportation – trips are generally subdivided into the categories of personal errands, social, work, church/school, work-related-business, and other. We utilized the LATCH dataset that was produced in 2017.
The measure of vehicle-miles was abbreviated to be est_vmiles throughout this analysis. A vehicle-mile is defined by the United States Bureau of Transportation Statistics LATCH Survey to be one vehicle traveling one mile. This means that this dataset is passenger-independent, as more than one passenger can ride in the same vehicle. (However, adjustments for trains and other multi-car modes of transportation are included.) We utilized the LATCH dataset that was produced in 2017.
The measure of the proportion of a census tract’s area that is designated to be a park was abbreviated to be ParkAreaProportion. This measure is produced by the National Neighborhood Data Archive (NaNDA) at the University of Michigan. This dataset utilized the census TIGER/Line shapefiles (2010) to accurately delineate census tract boundaries and, through ArcGIS, utilized a pairwise intersect to merge enclosed areas from those shapefiles and the ParkServe shapefiles, which provided geospatial data on parks. Then, area inside of enclosed areas that also include parks from the ParkServe shapefiles were used to calculate the proportion of parks per census tract (as a percentage) utilized in this analysis. We utilized the NaNDA dataset that was produced in 2018.
The measure of the amount of public transit stops within a census tract was abbreviated as StopsperSqMile throughout this analysis and was also sourced from NaNDA. Census tract delineation for this dataset utilized the TIGER/Line Shapefiles (2010), and data for the public transport stops was sourced from the National Transit Map (NTM). This data allowed for sufficient information to derive the count and density of transit stops within a census tract. We utilized the NaNDA dataset that covered the years between 2016 and 2018.
The measure of Housing and Transportation Area Median Income was abbreviated to ht_ami throughout this analysis. ht_ami is calculated to be the percentage of the income for the regional typical household (which is also the area median income) that is attributed to both housing and transportation costs. This calculation derives from the H+T Affordability Index. A regional typical household is defined to be a household income that is calculated as the median of the region. Additionally, for these households, regional average household size and commuters per household values are utilized. We utilized the dataset produced in 2022, which relies on 2019 data for its production.
The measure of employee gravity was abbreviated to empgravity throughout this analysis. Employee gravity is defined by the H+T Affordability Index as the total number of jobs in a tract divided by the square of its distance from the tract centroid. The tract centroid is the approximate geographic center of the geospatial census tract polygon. We utilized the dataset produced in 2022, which relies on 2019 data for its production.
The measure of noise exposure was specifically chosen to represent the population-weighted noise exposure, and is abbreviated nsaverage throughout the analysis. We utilized a formula to manually calculate the population-weighted noise exposure from the National Transportation Noise Exposure dataset provided by the University of Washington Deohs School of Public Health. This formula, instead of using specific decibel ranges of noise exposure, provides a single metric for each census tract by combining a weighted average of all of the decibel ranges. We utilized the dataset that was produced within a 2023 paper.
To begin data analysis, a level of territorial division had to be selected uniformly across the US. While states, counties, and city limits were too general and would not allow analysis of individual neighborhood and urban areas, census blocks were too specific and would provide overfitted data. Census tracts were chosen over ZIP codes because ZIP codes center around mailing and delivery statistics, making them a more inaccurate marker of demographic trends.
The initial sample size comprised all census tracts in the United States. However, because this study centres around urban planning, not all census tracts need to be studied. A dataset was utilized to determine if census tracks were urban, suburban, or rural based on their respective population densities. Since urban planning effects on rural areas are negligible, they were removed from the study. This did not significantly impact the sample size, as the census tract count remained over 40,000.
To finalise the datasets, multiple explanatory variables (e.g. public transportation frequency, park area, etc.) and response variables (e.g. Chronic obstructive pulmonary disease (COPD), obesity, etc.) that were tied to urban policy were explored. Statistics for these variables were found on the census-tract scale and aggregated from a singular dataset that housed all census tracts and their respective demographic data. Additionally, statistics were segregated based on their population density within the respective parameters shown in \creftab:base-stats.
At this stage, we had a finalized set of eight individual urban indicators and six health factors, which were all cleaned and imputed as mentioned in the Data Processing section. However, in order to further proceed into model development, we determined a methodology to filter the indicator subset to reduce computational strain and prevent overfitting. A standard correlation matrix was rejected to detect good fit because of a future objective to construct a non-linear model – correlation matrices typically use ordinary least squares (OLS) regression, which is a linear process. Thus, we picked mutual information to select indicators. Mutual information can be defined as how much information on one variable is given by another variable.
We computed the mutual information values in R by using the condinformation function from the infotheo library. By setting up a nested for loop to cycle through both indicator and health factor subsets, we compute all permutations of mutual information, and can consolidate those data points into a matrix as shown below:
Two variables had noticeably higher mutual information (MI) values than the others: est_vmiles and est_ptrp. This makes sense for several reasons. Since those two variables are inherently related (but not related enough to have a high multicollinearity, as seen in the original multicollinearity analysis), they should share consistent results. Also, health outcomes come in similar frequencies, especially since many can be related. Additionally, we noted that graphing a line graph for mutual information values from high to low would form a shape resembling exponential decay, which indicates that the majority of indicators would not have high mutual information values – that is, only a few indicators highly influence dependent variable behaviour.
We also noted that there were no standard thresholds for mutual information values in studied scientific literature: “In general, it is not straightforward to develop guidelines for cut-offs of MI indicating weak or strong relationships” (Young, 2023). The only indication of relative mutual information is when MI values are zero, which indicates the absence of relationships between the analysed independent variable and dependent variable. Therefore, we could not create a substantiated threshold for reference. Thus, we determined that the majority of dependent variable information can be explained, on the line graph, by the variables with MI values above the sharpest bend of the graph. This can be determined by inspection.
However, solving for MI values only rejects any variability in mutual information between states, as we consider states as a hierarchical variable later. Thus, we must use conditional information, where we can calculate the amount of information that one variable provides over the other when we are given a fixed value for a third variable; in this situation, the third variable is the state. By running a similar function in R with calculating MI values, except adding a function to condition on the “States” variable, we produced the following conditional mutual information (CMI) matrix:
Here, we also notice similar patterns to the MI value matrix, where the variables est_vmiles and est_ptrp are able to explain the most information for each health disease factor. We deemed this result to be reasonable as conditioning on the state should not heavily alter the amount of information explained.
The model that we chose to calculate our UHI, Urbhealth, was a hierarchical generalised additive model. First, we wanted to select a model that represented non-linear trends, which allows for non-linear relationships in data to be explored. Therefore, we selected an HGAM, which allowed us to form a custom polynomial as a summation of multiple non-linear functions for the individual indicators. A hierarchical version of a generalised additive model (GAM) was selected in order to establish hierarchies by state. We chose to establish these state hierarchies to produce a single common smoother among every model. in addition to state smoothers that have equal wiggliness. We selected this HGAM methodology to acknowledge that states should have similar template model shapes (as there are no drastic differences in disease frequencies across the United States); however, we also allowed for state-level smoothers to incorporate smaller variations in model shapes that could differentiate across states.
Since all variables studied are considered continuous (except the State variable, which when quantitatively indexed, is discrete), it is appropriate to represent all of the individual indicators as smooth functions, by the s() function in the “mgcv” library in R. Smooth functions were assigned to each urban indicator with the default bs = “tp” function, which produces thin plate splines (TPS). We selected TPSs as a model because they are “ideal for examining the combined effect of two continuous predictors on a single outcome”, which facilitated modelling between multiple urban indicators and individual health factors.
In order to factor in the State variable in the model, which is a discrete variable when indexed, we utilized the factor smooth (“fs”) model parameter within the mgcv library, where bs = “fs”. This model parameter was a modification of the proposed HGAM model parameters established by Pederson in the context of a global smoothing parameter that creates generalised shape while also accounting for geographic heterogeneity through smaller, adjusting smoothing terms for individual state groups (Pederson, 2019), which aligns with our goals of model construction stated in the above paragraphs in this section.
Multicollinearity, or significant evidence of correlation between multiple explored independent variables, was investigated by observing concurvity, a non-linear analogue to the variance inflation factor (VIF). We deemed it improper to use multicollinearity analysis for the HGAM model because multicollinearity observes overlapping linear trends, while this model utilized non-linear trends. This increases the chance of false multicollinearity analysis values that do not accurately reflect on the model shape; utilizing concurvity accommodates for that issue. Concurivty produces three separate analyses estimates by examining a ratio of a decomposed basis model g to the original smooth model f if g consists of the basis functions of all other variablex except the one analysed. Although there is no prevalent established threshold for model quality based on concurvity value, some sources indicate a threshold of 0.5 or lower indicates a viable model with low non-linear multicollinearity (Ramsay, 2003). Although the concurvity function produces three separate analyses, the “estimated” analysis was used for accurate model analysis, while the “worst” analysis was used for a rough benchmark of model performance.
Statistical significance in this project was analysed within the scholarly social sciences parameters, in which a standard R^2 between 0.10 and 0.50 is acceptable as long as each explanatory variables is significant relative to the response variable (p < 0.05).
Most states were not placed in clusters in order to maximize homogeneity in samples. However, ones that had low population/amount of census tracts were placed in geographic clusters.
Descriptive statistics
Model images
R2, p-values, significance of smooth terms
Discussion of FREML values
Table 1 displays the descriptive statistics for the data utilized in this statistical analysis. There were a total of 39,344 census tract samples that were utilized in this analysis. All disease statistics are reported in decimal proportions. Units for other data points are described in the Data Description section.
dvs <- c('CHD', 'Depression', 'COPD', 'DIABETES', 'HIGHCHOL', 'BPHIGH', 'Obesity')
data <- read_excel("/Users/dmach/Downloads/clean.xlsx") # non-cleaned data
#rename(data, PAProp = "ParkAreaProportion", StopsSqMile = "StopsperSqMile")
ivs <- c('ParkAreaProportion', 'StopsperSqMile', 'est_vmiles', 'est_ptrp', 'ht_ami', 'emp_gravity', 'nsaverage')
data <- data |>
dplyr::select(!starts_with("Unnamed")) |> # remove extra columns
filter(State != "FL") |> # remove Florida data
filter(emp_gravity < 300000) # remove emp_gravity single outlier
data$est_ptrp <- na_if(data$est_ptrp, 0) # replace 0 with NA
data$ht_ami <- na_if(data$ht_ami, 0)
data$TractID <- factor(data$TractID) # convert to factor variable
data$State <- factor(data$State)
data[dvs] <- data[dvs] / 100 # convert to percentage 0-1
data <- drop_na(data) # drop missing data
# knitr::kable(summary(data), format = "pipe")
table1 <-
data |>
tbl_summary(
statistic = all_continuous() ~ "{mean} ({sd})",
include = c(ivs, dvs)) |>
modify_caption("Table 1. Descriptive statistics of data.")
table1| Characteristic | N = 39,3441 |
|---|---|
| ParkAreaProportion | 0.04 (0.08) |
| StopsperSqMile | 16 (27) |
| est_vmiles | 34 (11) |
| est_ptrp | 8.16 (1.08) |
| ht_ami | 46 (11) |
| emp_gravity | 42,784 (44,931) |
| nsaverage | 53.42 (2.45) |
| CHD | 0.055 (0.018) |
| Depression | 0.21 (0.04) |
| COPD | 0.065 (0.025) |
| DIABETES | 0.11 (0.04) |
| HIGHCHOL | 0.34 (0.04) |
| BPHIGH | 0.31 (0.07) |
| Obesity | 0.34 (0.07) |
| 1 Mean (SD) | |
As mentioned in the Feature Selection section of the Methods section in this paper, the objective, quantitative measures of both mutual information (MI) and conditional mutual information (CMI). Figures 2.1 and 2.2 indicate the corresponding heatmaps for MI and CMI, respectively.
nbins <- sqrt(NROW(data))
numeric_cols <- c(ivs, dvs, 'Lat') # Columns to discretize
discretized_data <- data
for (col in numeric_cols) {
discretized_data[[col]] <- discretize(discretized_data[[col]], disc = "equalwidth", nbins = nbins)
}
cor_matrix <- cor(discretized_data[numeric_cols], use = "complete.obs")
# Print correlation matrix to understand relationships
# print(cor_matrix)
for (dv in dvs) {
for (iv in ivs) {
cond_info <- condinformation(discretized_data[[iv]], discretized_data[[dv]], discretized_data[['State']])
# print(paste("Conditional Information for", iv, "and", dv, ":", cond_info))
}
}
cmi_values <- matrix(0, nrow = length(ivs), ncol = length(dvs))
rownames(cmi_values) <- ivs
colnames(cmi_values) <- dvs
mi_values <- matrix(0, nrow = length(ivs), ncol = length(dvs))
rownames(mi_values) <- ivs
colnames(mi_values) <- dvs
for (i in 1:length(ivs)) {
for (j in 1:length(dvs)) {
mi_values[i, j] <- condinformation(discretized_data[[ivs[i]]],
discretized_data[[dvs[j]]])
}
}
# Melt the matrix for ggplot
mi_melted <- melt(mi_values)
colnames(mi_melted) <- c("IV", "DV", "MI")
# Create the heatmap
ggplot(mi_melted, aes(x = DV, y = IV, fill = MI)) +
geom_tile() +
geom_text(aes(label = round(MI, 3)), family = "Baskerville") +
scale_fill_gradient(low = "white", high = "red") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
text = element_text(family = "Baskerville", size = 14),
plot.title = element_text(hjust = 0.5, family = "Baskerville"),
plot.caption = element_text(hjust = 0.5)) +
labs(title = "Mutual Information (MI) Heatmap",
x = "Health Conditions",
y = "Environmental Factors",
fill = "MI",
caption = "Figure 2.1 — Heatmap of Mutual Information Values")nbins <- sqrt(NROW(data))
numeric_cols <- c(ivs, dvs, 'Lat') # Columns to discretize
discretized_data <- data
for (col in numeric_cols) {
discretized_data[[col]] <- discretize(discretized_data[[col]], disc = "equalwidth", nbins = nbins)
}
cor_matrix <- cor(discretized_data[numeric_cols], use = "complete.obs")
# Print correlation matrix to understand relationships
# print(cor_matrix)
for (dv in dvs) {
for (iv in ivs) {
cond_info <- condinformation(discretized_data[[iv]], discretized_data[[dv]], discretized_data[['State']])
# print(paste("Conditional Information for", iv, "and", dv, ":", cond_info))
}
}
cmi_values <- matrix(0, nrow = length(ivs), ncol = length(dvs))
rownames(cmi_values) <- ivs
colnames(cmi_values) <- dvs
for (i in 1:length(ivs)) {
for (j in 1:length(dvs)) {
cmi_values[i, j] <- condinformation(discretized_data[[ivs[i]]],
discretized_data[[dvs[j]]],
discretized_data[['State']])
}
}
# Melt the matrix for ggplot
cmi_melted <- melt(cmi_values)
colnames(cmi_melted) <- c("IV", "DV", "CMI")
# Create the heatmap
ggplot(cmi_melted, aes(x = DV, y = IV, fill = CMI)) +
geom_tile() +
geom_text(aes(label = round(CMI, 3)), family = "Baskerville") +
scale_fill_gradient(low = "white", high = "red") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
text = element_text(family = "Baskerville", size = 14),
plot.title = element_text(hjust = 0.5, family = "Baskerville"),
plot.caption = element_text(hjust = 0.5)) +
labs(title = "Conditional Mutual Information (CMI) Heatmap",
x = "Health Conditions",
y = "Environmental Factors",
fill = "CMI",
caption = "Figure 2.2 — Heatmap of Conditional Mutual Information Values")for (dv in dvs) {
cat("\n## ", dv, "\n\n")
cat("### Model Fitting\n\n")
rds_file <- paste0("fsmodel_", dv, ".rds")
if (!file.exists(rds_file)) {
cat(paste("RDS file", rds_file, "not found.\n\n"))
next
}
modG <- readRDS(file = rds_file)
#plot(modG, residuals = TRUE)
# TRIAL STUFF
pred_grid <- expand.grid(
ht_ami = seq(min(data$ht_ami), max(data$ht_ami), length.out = 100),
nsaverage = mean(data$nsaverage),
est_vmiles = mean(data$est_vmiles),
est_ptrp = mean(data$est_ptrp),
State = unique(data$State)
)
# Make predictions
preds <- predict(modG, newdata = pred_grid, se.fit = TRUE, type = "terms")
# Create a data frame for plotting
plot_data <- data.frame(
ht_ami = rep(pred_grid$ht_ami, times = length(unique(data$State))),
State = rep(unique(data$State), each = 100),
Effect = preds$fit[, "s(ht_ami,nsaverage,State)"],
SE = preds$se.fit[, "s(ht_ami,nsaverage,State)"]
)
# Create the plot
testplot <- ggplot(plot_data, aes(x = ht_ami, y = Effect, color = State)) +
geom_line() +
# geom_ribbon(aes(ymin = Effect - 1.96 * SE,
# ymax = Effect + 1.96 * SE,
# fill = State),
# alpha = 0.2) +
labs(x = "ht_ami", y = "Effect",
title = "Smooth effect of ht_ami by State") +
theme_minimal()
print(testplot)
cat("\n\n### Model Summary\n\n")
sum_modG <- summary(modG)
sum_modG_text <- capture.output(print(sum_modG))
cat("```\n")
cat(paste(sum_modG_text, collapse = "\n"))
cat("\n```\n\n")
ivlist <- list("est_vmiles", "est_ptrp", "ht_ami", "nsaverage")
cat("\n### Odds Ratios \n\n")
format_or_gam <- function(or_data, digits = 3, html = TRUE) {
# Round numeric columns
or_data[] <- lapply(or_data, function(x) if(is.numeric(x)) round(x, digits) else x)
kable(or_data,
format = "html",
digits = digits,
align = c('l', 'r', 'r', 'r', 'r')) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE,
position = "left") %>%
row_spec(0, bold = TRUE)
write.csv(or_data, paste0("/Users/dmach/Downloads/OR_", dv, ".csv"))
}
for (iv in ivlist) {
or <- or_gam(
data = data,
model = modG,
pred = iv,
percentage = 20,
slice = TRUE
)
f <- format_or_gam(or)
print(f)
}
cat("\n\n") # Add space between analyses
}| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -2.856596 | 0.0203139 | -140.6227 | 0 |
| edf | Ref.df | F | p-value | |
|---|---|---|---|---|
| s(est_vmiles) | 7.949261 | 8.685755 | 80.375989 | 0 |
| s(est_ptrp) | 7.695052 | 8.500817 | 171.384347 | 0 |
| s(ht_ami,nsaverage,State) | 430.234204 | 1435.000000 | 5.695203 | 0 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -1.208588 | 0.0203655 | -59.34484 | 0 |
| edf | Ref.df | F | p-value | |
|---|---|---|---|---|
| s(est_vmiles) | 8.797781 | 8.984942 | 113.39784 | 0 |
| s(est_ptrp) | 7.668288 | 8.502935 | 48.98440 | 0 |
| s(ht_ami,nsaverage,State) | 443.118985 | 1435.000000 | 29.97872 | 0 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -2.683025 | 0.0301939 | -88.8598 | 0 |
| edf | Ref.df | F | p-value | |
|---|---|---|---|---|
| s(est_vmiles) | 8.321323 | 8.852998 | 180.54031 | 0 |
| s(est_ptrp) | 6.756558 | 7.785455 | 81.92769 | 0 |
| s(ht_ami,nsaverage,State) | 458.012535 | 1435.000000 | 12.16770 | 0 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -2.155941 | 0.0340742 | -63.27189 | 0 |
| edf | Ref.df | F | p-value | |
|---|---|---|---|---|
| s(est_vmiles) | 8.691671 | 8.966153 | 547.96119 | 0 |
| s(est_ptrp) | 7.792261 | 8.568839 | 149.32333 | 0 |
| s(ht_ami,nsaverage,State) | 449.705301 | 1435.000000 | 13.43657 | 0 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.6855597 | 0.013391 | -51.19549 | 0 |
| edf | Ref.df | F | p-value | |
|---|---|---|---|---|
| s(est_vmiles) | 7.636344 | 8.518987 | 32.945223 | 0 |
| s(est_ptrp) | 8.094421 | 8.749189 | 145.548293 | 0 |
| s(ht_ami,nsaverage,State) | 402.664741 | 1435.000000 | 6.722335 | 0 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.7749023 | 0.0295856 | -26.19186 | 0 |
| edf | Ref.df | F | p-value | |
|---|---|---|---|---|
| s(est_vmiles) | 8.535775 | 8.930912 | 159.37032 | 0 |
| s(est_ptrp) | 7.638040 | 8.477366 | 128.37317 | 0 |
| s(ht_ami,nsaverage,State) | 433.253366 | 1435.000000 | 11.76405 | 0 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.6524397 | 0.030894 | -21.11865 | 0 |
| edf | Ref.df | F | p-value | |
|---|---|---|---|---|
| s(est_vmiles) | 8.520263 | 8.926150 | 237.70937 | 0 |
| s(est_ptrp) | 6.482026 | 7.564677 | 263.97430 | 0 |
| s(ht_ami,nsaverage,State) | 412.230868 | 1435.000000 | 20.49727 | 0 |
Tables 2.1-2.7 display the parameter estimates for the hierarchical generalised additive models. used in this analysis. Since there were multiple dependent variables analysed in this analysis, there are multiple tables to account for each dependent variable. The variables shown reflect the parsimonious feature selection process introduced in the Methods section of this paper.
| Statistics | Value |
|---|---|
| R-squared (adj) | 0.313 |
| Deviance explained | 0.322 |
| Number of observations | 39344 |
| FREML score | -52183 |
| Scale estimate | 0.0040158 |
| Statistics | Value |
|---|---|
| R-squared (adj) | 0.569 |
| Deviance explained | 0.573 |
| Number of observations | 39344 |
| FREML score | -51553 |
| Scale estimate | 0.0041262 |
| Statistics | Value |
|---|---|
| R-squared (adj) | 0.475 |
| Deviance explained | 0.484 |
| Number of observations | 39344 |
| FREML score | -47729 |
| Scale estimate | 0.0050164 |
| Statistics | Value |
|---|---|
| R-squared (adj) | 0.49 |
| Deviance explained | 0.489 |
| Number of observations | 39344 |
| FREML score | -38928 |
| Scale estimate | 0.0078459 |
| Statistics | Value |
|---|---|
| R-squared (adj) | 0.224 |
| Deviance explained | 0.226 |
| Number of observations | 39344 |
| FREML score | -42285 |
| Scale estimate | 0.0066422 |
| Statistics | Value |
|---|---|
| R-squared (adj) | 0.423 |
| Deviance explained | 0.421 |
| Number of observations | 39344 |
| FREML score | -27572 |
| Scale estimate | 0.0140045 |
| Statistics | Value |
|---|---|
| R-squared (adj) | 0.512 |
| Deviance explained | 0.51 |
| Number of observations | 39344 |
| FREML score | -30078 |
| Scale estimate | 0.0123063 |
Tables 3.1(1-3) to 3.7(1-3) display the concurvity estimates for the same models summarised in Tables 2.1 to 2.7, while Figures 3.1(1-3) to 3.7(1-3) present the numerical values shown in tables 3.1 to 3.7 in heatmap form.
for (dv in dvs) {
cat("\n## ", dv, "\n\n")
rds_file <- paste0("fsmodel_", dv, ".rds")
if (!file.exists(rds_file)) {
cat(paste("RDS file", rds_file, "not found.\n\n"))
next
}
modG <- readRDS(file = rds_file)
# save / load the concurvity files as RDS
# cv <- concurvity(modG, full = FALSE)
# saveRDS(cv, file = paste("concurvity_", dv))
rdsConc <- paste("concurvity_", dv)
if (!file.exists(rdsConc)) {
cat(paste("Concurvity data file", rdsConc, "not found.\n\n"))
next
}
cv <- readRDS(file = rdsConc)
tablefunct <- function(listMatrix, index) {
# assign proper type
if (index == 1) {
type <- " — Worst"
}
else if (index == 2) {
type <- " — Observed"
}
else {
type <- " — Estimate"
}
# now, print the output
cat(paste("### Concurvity", type, " \n\n"))
print(
knitr::kable(listMatrix, format = "html") |>
kable_styling("striped")
)
}
# make 3 separate tables: worst, observed, estimate (in order)
i <- 1
while (i <= 3) {
tablefunct(cv[[i]], i)
i <- i + 1
}
}| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0012278 | 0.0004125 | 1.0001644 |
| s(est_vmiles) | 0.0012278 | 1.0000000 | 0.4721100 | 0.6910873 |
| s(est_ptrp) | 0.0004125 | 0.4721100 | 1.0000000 | 0.4377259 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.6909819 | 0.4377609 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0004568 | 0.0001226 | 0.0081149 |
| s(est_vmiles) | 0.0012278 | 1.0000000 | 0.4413361 | 0.0157010 |
| s(est_ptrp) | 0.0004125 | 0.4273857 | 1.0000000 | 0.0175864 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.6070174 | 0.3923086 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0001072 | 0.0000761 | 0.0029092 |
| s(est_vmiles) | 0.0012278 | 1.0000000 | 0.3595512 | 0.0238952 |
| s(est_ptrp) | 0.0004125 | 0.2869857 | 1.0000000 | 0.0052191 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.5557620 | 0.3449646 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0012132 | 0.0004161 | 1.0004239 |
| s(est_vmiles) | 0.0012132 | 1.0000000 | 0.4721085 | 0.6919719 |
| s(est_ptrp) | 0.0004161 | 0.4721085 | 1.0000000 | 0.4377105 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.6917463 | 0.4377186 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0000256 | 0.0000208 | 0.2838409 |
| s(est_vmiles) | 0.0012132 | 1.0000000 | 0.3634990 | 0.0561032 |
| s(est_ptrp) | 0.0004161 | 0.1862516 | 1.0000000 | 0.0798057 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.6489298 | 0.3638193 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0001092 | 0.0000756 | 0.0029103 |
| s(est_vmiles) | 0.0012132 | 1.0000000 | 0.3595500 | 0.0238756 |
| s(est_ptrp) | 0.0004161 | 0.2869417 | 1.0000000 | 0.0052194 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.5559826 | 0.3451122 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0012278 | 0.0004125 | 1.0001644 |
| s(est_vmiles) | 0.0012278 | 1.0000000 | 0.4721100 | 0.6910873 |
| s(est_ptrp) | 0.0004125 | 0.4721100 | 1.0000000 | 0.4377259 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.6909819 | 0.4377609 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0003812 | 0.0001311 | 0.0155275 |
| s(est_vmiles) | 0.0012278 | 1.0000000 | 0.2866162 | 0.1013943 |
| s(est_ptrp) | 0.0004125 | 0.4344971 | 1.0000000 | 0.1638668 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.6118282 | 0.2630924 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0001072 | 0.0000761 | 0.0029092 |
| s(est_vmiles) | 0.0012278 | 1.0000000 | 0.3595512 | 0.0238952 |
| s(est_ptrp) | 0.0004125 | 0.2869857 | 1.0000000 | 0.0052191 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.5557620 | 0.3449646 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0012278 | 0.0004125 | 1.0001644 |
| s(est_vmiles) | 0.0012278 | 1.0000000 | 0.4721100 | 0.6910873 |
| s(est_ptrp) | 0.0004125 | 0.4721100 | 1.0000000 | 0.4377259 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.6909819 | 0.4377609 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0003029 | 0.0000040 | 0.0166975 |
| s(est_vmiles) | 0.0012278 | 1.0000000 | 0.3300909 | 0.0367262 |
| s(est_ptrp) | 0.0004125 | 0.4165053 | 1.0000000 | 0.1033355 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.6502921 | 0.3516378 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0001072 | 0.0000761 | 0.0029092 |
| s(est_vmiles) | 0.0012278 | 1.0000000 | 0.3595512 | 0.0238952 |
| s(est_ptrp) | 0.0004125 | 0.2869857 | 1.0000000 | 0.0052191 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.5557620 | 0.3449646 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0012147 | 0.0003975 | 1.0001701 |
| s(est_vmiles) | 0.0012147 | 1.0000000 | 0.4721085 | 0.6913076 |
| s(est_ptrp) | 0.0003975 | 0.4721085 | 1.0000000 | 0.4381736 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.6910991 | 0.4380013 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0002238 | 0.0001375 | 0.0003318 |
| s(est_vmiles) | 0.0012147 | 1.0000000 | 0.3844476 | 0.1813834 |
| s(est_ptrp) | 0.0003975 | 0.0522158 | 1.0000000 | 0.1056478 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.2538012 | 0.3378970 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0001054 | 0.0000764 | 0.0029105 |
| s(est_vmiles) | 0.0012147 | 1.0000000 | 0.3596320 | 0.0238716 |
| s(est_ptrp) | 0.0003975 | 0.2858566 | 1.0000000 | 0.0052194 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.5561047 | 0.3456529 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0012278 | 0.0004125 | 1.0001644 |
| s(est_vmiles) | 0.0012278 | 1.0000000 | 0.4721100 | 0.6910873 |
| s(est_ptrp) | 0.0004125 | 0.4721100 | 1.0000000 | 0.4377259 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.6909819 | 0.4377609 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0003662 | 0.0001494 | 0.0225863 |
| s(est_vmiles) | 0.0012278 | 1.0000000 | 0.3582694 | 0.0088022 |
| s(est_ptrp) | 0.0004125 | 0.3425023 | 1.0000000 | 0.0610935 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.6276611 | 0.3202603 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0001072 | 0.0000761 | 0.0029092 |
| s(est_vmiles) | 0.0012278 | 1.0000000 | 0.3595512 | 0.0238952 |
| s(est_ptrp) | 0.0004125 | 0.2869857 | 1.0000000 | 0.0052191 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.5557620 | 0.3449646 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0011816 | 0.0004049 | 1.0000061 |
| s(est_vmiles) | 0.0011816 | 1.0000000 | 0.4721073 | 0.6909060 |
| s(est_ptrp) | 0.0004049 | 0.4721073 | 1.0000000 | 0.4378509 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.6908953 | 0.4377423 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0003793 | 0.0000299 | 0.0234209 |
| s(est_vmiles) | 0.0011816 | 1.0000000 | 0.4127897 | 0.0924731 |
| s(est_ptrp) | 0.0004049 | 0.4277435 | 1.0000000 | 0.2086027 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.6095835 | 0.4143929 | 1.0000000 |
| para | s(est_vmiles) | s(est_ptrp) | s(ht_ami,nsaverage,State) | |
|---|---|---|---|---|
| para | 1.0000000 | 0.0001053 | 0.0000746 | 0.0029094 |
| s(est_vmiles) | 0.0011816 | 1.0000000 | 0.3595452 | 0.0238835 |
| s(est_ptrp) | 0.0004049 | 0.2857350 | 1.0000000 | 0.0052188 |
| s(ht_ami,nsaverage,State) | 1.0000000 | 0.5560765 | 0.3451523 | 1.0000000 |
Prevalence can be derived from the partial effects graphs of the GAMs shown in the Models section with the following formula:
prevalence = exp(linear_predictor) / (1 + exp(linear_predictor))